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ABSTRACT . 

x k ' Some type of difference or change score is frequently 
used to quantify the effects of experimental treatments and 
educational programs on individuals and on groups of individuals Two 
studies investigated the tenability of the assumption- that classroom 
instruction results in increases in students"' achievement levels 
while the qualitative nature of that achievement remains constant 
across time. The data utilized were -the item responses to tests in 
basic mathematics and in general biology administered -as pretests and 
after instruction to students enrolled in those courses. 'Results 
indicated that this assumption was not tenable in the biology data 
set, where increases ,in mean achievement level were accompanied by 
corresponding cR&nges in the factor structure underlying the item 
responses. F6r the matheAatfcs data,- however, there, was no such 
violation of the assumption; as student achievement levels increased 
the underlying factor „ structure remained unchanged. The implications 
of these results fo* psychology, education, and program evaluation 
are noted, (Author/GK) 
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Dimensionality of' Measured Achievement Over Time ■ ' 

• ■ • * 

The measurement of individual or group change^ is central to many issues in 
0 the fields of psychology, education, qhd program evaluation. Psychologists, 
educators, and (more recently) evaluator^ typically use*_dif ferences in test 
scpires to quantify the effects pf experimental treatments and' educational 6 pro T 
grams on individuals and on groups of individuals. ? 

The typical paradigm for measuring change involves the administration of a 
standardized achievement tes t both before and after an experimental treatment 01 
•program implementation; the-effett of the treatment intervention is theff consid- 
ered to be a function of the mean difference betwee^Tthe two sets of 'test 
scores. *If two or mof6 .groups of students are involved, comparisons can also b< 
made between treattaent and control groups, -or among gxSCfps exposed to various 
treatments or involved in seysral different programs. Again, evaluation of 
treatment effects involves comparing the mean achievement gain (typically, a 
function of . the' difference scores) observed for &ach*group. Individual gain' or 
change is also freqifently used to measure an individual's growth in 'achievement 
level or change due to a treatment or special progr'ajn. /. 

Lord ( 1963) .and Cronbach and Furby ( 1970), among others, have discussed t;hc 
methodological and statistical problems involved In using difference scores to 1 
.measure change , or growth and have presented some possible solutions.* Whether 
measurements .of change involve the use of simple difference scores,, their deriv- 
atives, or some more complex methodological* design/ the measurement; prodess it- 
self assu^s that the treatment or instruction results in incireased'levels of 
the same "trait or characteristic "^that was meas* red 'originally and that th^ only 
change that occurs" 4s a 'quantitative one. •« ' 1 

That this assumption may be violated has long been evident in studies, of 
intelligence and intellectual growth. Ga^rett^ 1946) noted-that '/'intelligence 
dhanges in its prganization" (p. 373) and ca*led for corresponding changes in 
the way intelligence j^s measured. This "diTf erenti*ation hypothesis" spawned 
•much research ( see Re inert ,,-1570, for a review) concefning the changes in the . 
structure, and organization of intelligence throughout thfe-hutoan life span. Some 
of these studies report results supporting the hypothesis of age differentia- ' 
tion; others offer support for a hypothesis of age integration, and still others 
provide evidence in support of both these hypotheses. Nearly all this research, 
however, has found that the structure of intelligence-, as~~aefined 1>y factor 
analysis, does not remain constant with age and experience. * 

Other authors (Anastasi, 1936; Ferguson, 1*954; Games, 1962 ; . r Woodrow, 1938, 
1939a, 1939b, 1939c) have investigated the changes in verbal' ability .and intel- 
lectual factor structure that accompany shorter ^erm training .and ^practice'. 
Similar factor-analytic investfgations have been made in the areas of . psychomo-: 
tor behavior (Fleishman, 1951, 1957, 1960; Fleishman & Hempel, 1954, 1955; 
Greene, 1943), ptfychoUnguistic abilities ( Querishi, *1967) , word association 
(Sullivan & Moran, 1967; Swartz%& Jforan, 1968), and even the learning of Morse 
code (Fleishman & Fruchter, 1960). A ll these authors have found that the facto- 



rial structure of .abilities underlying task performance changes in a systematic 
way. with training and practice. An individual's status at a later point in 
time, then, may be qualitatively different from his/her status as originally 
measured. 

tfohlwili ( 1970) discusses this issue, of quantitative versus qualitative 
change more generally in the area of developmental psychology and, like Garrett 
( 1946), calls for more sophisticated .scaling methods which will 

' ... allow us to assess an individual's status on a developmental dimen- 

* sion in a manner such as to ensure not only comparability of content 
for the different parts of 'that dimension, but at the same time a con- 
tinuous scale along which developmental change can be charted .... 
j Postulating a unitary dimension across the age Span under investigation 
presupposes that there are no-major discontinuities in the development 
of the behavior irf question, such as th^re obviously'' are in the assess- 
ment of intelligence when we move from infancy to childhood, (p. 154) 

Although Reinert ( 1970) . called for the investigation of possible factor- 
structure changes in areas other than intelligence and abilities more than a 
decade ago, no research has yet extended this line of questioning into the area ' 
of classroom achievement. That is, there have been no reported studies that 
have systematically investigated whether the individual and -group changes that 
occur after classroom instruction or program participation are quantitative 
changes in the level of achievement, as is generally assumed, or whether more 
qualitative changes in the sftriicture $>f the achievement variable have occurred.. 

* 

Kingsbury and Weiss (1979) studied the effects of testing students at dif- 
ferent points in instruction. They reported' that the single factor extracted* 
from the item responses to a college general biology examination administered on 
the first day of \class and the factor extracted fropa< the item responses to a 
classroom midquarter examination differed markedly fr^m e^ch- other in terms of 
s-trength; however , *they could not further investigate, the similarity of the fac- 
tor pattern loadings from both administrations. They cautioned that, replica- 
tions of their findings contrasting the pretest factor with the later achieve- 
ment .factor wo*uld render dif fererfce scores "completely useless" as indicators *of 
achievement level growth, Since different' variables would, in fact,, be measured 
at the two paints in time* ^ * 

The* importance of 'such a. conclusion shouLd not be underestimated. If dif- 
ferent characteristics are, in fact, being Measured at two diiferent occasions, 
tnei\ the. computation of any type of difference score is inappropriate . andfthe 
evaluation of program effectiveness and gains in. individual student achievement - 
must be made on some other basis. It is justifiable to use difference scores 
(statistical and methodological issues notwithstanding) only whea it can be dem- 
onstrated that qyantitative changes are the only changes accompanying instruc- 
tion, j 

Purpose _ . « 

The objectives of the present studies 'were to investigate the nature of the 
changes in the- dimensionality of achievement that occurred following instruction 
in .two different achievement domains — basic Mathematics ancf general'biology — and 
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to determine the appropriateness of calculating difference scores 'in order to 
measure change in these domains. 
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STUDY I 
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/ Subjects and Tests ^ 

Data were obtained from jjtudents enrolled in mathematics classes at the 
University of Minnesota's General, College during tfhe.faU quarter of 1979. 
These students were administered a 35-item Arithmetic Placement Test (APT) on 
the first day of class (pretest) and again as a final examination (posttest). 
The .APT is composed of five-alternative multiple-choice items covering such top- 
ics as addition, subtraction, multiplication, and division of whole numbers, 
fractions, decimals, and ^percents. 

Item responses were x:oded as correct, incorrect, or missing for the 259 
. Students. However, only 136 of the students answered every item, on the APT on 
both occasions, i.e., 123 students omitted or did not reach at least one item on 
either occasion. In many cases, clusters of items were omitted in the middle of 
the tests, which implied that students were omitting the groups of items for 
which they did not know the answers, rather than reaching a time limit for the 
test. To deal with this problem of missing data, a 15%-missing-data criterion 
was employed. A student's response protocol was deleted from the data set if 
the student omitted more than five items (i.e., 15% of 35 items) on'either ; the 
pretest or the posttest. This resulted in a group of 220 students on whidh ail 
further* analyses were based. For these 220 students, missing data were coded as 
incorrect on the assumption that the student did not answer the item because 
he/she did not know the answer and was unwilling to guess. 

Analyses 

Differences in achievement level estimates . The question" of interest with 
respect to achievement level estimates was whether there were differences in 
achievement level estimates due to instruction, i.e. , were students growing or 
gaining in achievement levels throughout the dpurse ,of instruction? Analyses 
pertinent to this question included comparison/ of the frequency^ distributions 
of number-correct scores both before and after Instruction and a t test for the 
difference between the means of scores on the pretest and the posttest. Compar- 
isons were also made of the distributions of item \d±f ficulties for each adminis- 
tration of the APT. The correlation between scorfes on the pretest and posttest 
was computed as an indication of the degrefe to which the scores were linearly 
related. 
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Differences in the structure of achievement . A related but less often in- 
vest igateFl^sue~lTl^hethe^ in the structure of item re- 
sponses due to instruction. Investigation of this issue involved computing and 
comparing the values of coefficient alphf as an index of internal consistency, 
which is related to the average level of inteVcorrelation of the items. More 
germane to this issue, however, was whether the factor structure underlying the 
test changed with instruction or whether it remained constant. Consequently, 



principal , axes factor analyses were performed separately ( on the pretest and 
posttest item responses. Pearson .product-moment correlations were computed be- 
tween pairs of item responses, and the diagonal elements of the interitem corre- 
.latipn matrices were replaced with initial estimates of the communalities of * 
each item, as^ given by the squared multiple correlation between that item and 
the other items in the matrix. An iterative procedure for improvingHEhese cpm- 
munality estimates was used, successively extractitig factors and re-estimating 
the commurialities . This process continued until the dif ference/between, two suc- 
cessive communality estimated was negligible (see Nie, Hull, Jenkins, Stein- 
brenner,' & Bent, 1975).. ' w , * 

Random sets of item responses were generated by simulating the responses *of * 
220 students to 35 items such that the probability o^^correct answer^ by any 
slmulee to an item was equal to the- difficulty (proportion correct) of'that 
item. This was done separately for the pretest and the posttest. Identical' 
procedures as performed for the re$l data were carried opt for intercorrelating 
the item responses N and factoring the resulting matrix. The results' of the fac- 
tor analyses of real and random data were compared .to determine the number of 
"nonrandom" factors existing in the real data. 

I 

The final factor solutions for the pretest and the posttest were then com- 
pared in terms of numbers of factors extracted and the similarities between/ 
them. Factor similarity was evaluated by computing the root-mean-square devia- 
tion, the product-moment correlation coef f icient , * and the cpefficient of congru- 
ence between the 'factor loadings of the factors extracted at: each test adminis- 
tration (see Harman, 19J6, pp. 343-344). These Similarity measures were com- 
pared with yalues obtained from* the two sets t>f random data, as recommended by 
Nesselroade and Baltes (1970). 
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Total score differences . Frequency distributions of number-correct scores 
for both administrations of the APT are presented in Appendix Table A; the fre- 
quency polygons are displayed in Figure 1. This figure ,stiows that although the 
distribution of pretest scores was approximately symmetric, the distribution of 
posttest scores was negatively skewed, indicating v the presence of a ceiling efr 
feet. Only four students answered all 35 items correctly on the postftest; an 
additional 77 students (oV 35%) incorrectly answered less than four items. The 
mean score on the pretest was 22.26, the median was 22.74, and 'the standard de- 
viation was 5.97. For the posttest these statistics were 28..91, 30.10, and 
4.88, respectively. A one-tailed t^ test for the difference between means of 
dependent groups was calculated to be 18.67, with probability < .0001. 

Item difficulties . The differences in raw score distributions observed 
between pretest and posttest were mirrored in the distributions of item dif fi x 
culties for the two administrations of the APT, as shown in Tjfce 1. Although 
the pretest items were, on the average, answered correctly nSP% often than not, 
nearly a third of them (i.e., 10 of 35) were answered incorrectly by at least 
half of ^he students. For the posttest, however, only two of the items were as 
difficult., In fact, one third of the items (12 of 35) were answered correctly 
by more than 90% of the .students. - * 1 
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Table 1 
Frequency Distributions of 
Item Difficulties for APT 
Administered as Pretest and as*Posttest 
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Range of Item 


Number 


of Items 


Difficulty * 


Pretest. 


Posttest 


.00 - .10 " 


• ,o 


0 


.11 - .20 


1 


0 


.21 - .30 • 


. 1 


0 


.31 - .40 


4 


' 0 


' .4*l-~ .50" 


- . 4 


2 


.51 - .60 


5 


0 


.61 - .70 


5' * 


3 


.71 - - .80 ( 


' 6 


9 


.81 - .90* 


5 


'9 


.9-1 - 1.00 


4 


12 


Mean- Difficulty 


• .64 


.83 ' 


m % 
4 


11 : 
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Correlation between scores . The Pearson product-moment correlation c&ef'fi- 
cient between number-correct scores at the two administrations of the APT was 
.542 • This relatively low val\ie x coupled with the evidence of mearf score in- 
creases, reveals that students did not, to a great extent, maintain their rela- 
tive standings in the course after instruction. / *■ 

Differences ,in the Structure of Achievement/ h r 

Internal consistency reliability . The internal consistency reliability of 
the APT,- as indexed >by coefficient alpha, was .836 foe the pretestr and .835 .for 
the posttest. That the reliability coefficient remained '^sentially constant 
provides some evidence for concluding that the items were functioning together 
in the same manner before and aftep instruction. However, since the variance of 
the scores decreased somewhat from pretest to posttest (see Appendix Table A), 
the stability of coefficient alpha may actually reflect a slight increase in the 
average interitem correlation. . 

Number of factors \extracted . ' The eigenvalues and percent of total variance 
accounted for by the first 15 factors from the APT and random d^ta are gtfeyen in 
Appendix Table* B. The plots of eigenvalues versus factors extracted for both 
the APT and the random data are given in Figure 2a for the pretest: and in Figure 
2b for the pQSttest. In both cases, there was one relatively strong factor in- 
the data; the eigenvalue for the first factor extracted from the* APT was much 
larger , than the eigenvalues for the remaining factors in the APT and for all the 
factors in the random data. The same cannbt be said for iny of the remaining^ 
factors. It was. concluded that a one-factor solution adequately described the 
item response data from both the pretest and the posttest. The FACTOR subrou- 
tine in SPSS (Nie et al., 1975) was then run again on the data from each admin- 
istration, specifying a single-factor solution each time. 

Factor similarity . The factor loadings on the single factor extracted from 
each administration of the APT and from corresponding random data arje given in 
Table 2. The loadings presented in Table 2 were of mbderate magnitude; the ma- 
jority of the loadings were greater 'than .300, but all were less than .700*. The 
patterns and the magnitudes of the loadings were essentially the same across • 
test administrations. For example, Items 2 through 5 &nd Item 28 were among the 
items with the lowest loadings at the pretestj the srame was Xrue for t^hese items 
at the posttest. tfhe items with the highest loadings. at the preteVtrwere also' 
ajnong the items with the highest- loadings at the posttest. -That tre magnitude 
of the loadings was similar for the two administrations can also be' seen by com- 
paring the percentage of total variance accounted for by^each factor. The sin- ' 
gle factor extracted f^om the APT pretest 'data accounted for, 13.02% of the total 
variance compared to 3.05% for the random data. The factor extracted from the 
APT posttest data was only slightly stronger, accounting for 14.593 of the total 
variance as compared to 2.^0% in the random data v . ^ 

Table 3 presents the measures of factor similarity between the APT factor 
loadings at* pretest and at posttest. The root-mean-square devJ.atibA between the 
loadings extracted at each administration is sensitive to differences .in the 
absolute levels ojE the loadings; low values, -Indicate only -minor differences be- 
tween the values of the two sets pf loadings/ The root-mean-square deviation % 
was a, low .089 foT these data. 'The product-moment correlation coefficient is 

12 • " ' * 
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Figure 2 

Eigenvalues for the First 15 Factors Extracted 
from the APT and from Corresponding Random Data 
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Table 2 . t . 
Factor Loadings on the Single Factor - 
Extracted from APT at Pretest and at Posttest, 
and from Carres ponding Random Data 



Item 




Pretest 


4 


Posttest 


APT 


Random Data 
. i 


.APT 


Random Data 


1 


.289 


.124 


.303 


-.042 


2 


.088 


.027 


-.004 


.130 * 


3 


- .058 


.315 


.152 


-.049 


4 


.160 


.010 


.2L9 


-.051 


. 5 


.191 


.230 


.226 


\s* :r4 ° 


6 


.263 


-.187 


.255 


.172 


7 


- .332 


-.188 


.118 


.032 


8 




.147 


.383 


.036 


9 


.156 


.099 


.341 


, .051 


10 


.384 


.150 • 


.495 


-,017 


11 


.453 


-.229 4 


. .253 


-.277 


12 . 


.372 


-.178 


.244 


-.170 


13 


-.255 


.007 


.259 


-.066 


14 


.394 


.345 


.338 


.136 


15 


.376 


.2*5 


.440 


.222 


16 ■ 


.575 


-.089 


.545 


.023 


17 


.426 


.075 


• .436- 


-.046 


18 


.562 


*?.285 _ 


.484 


.071 


19 


.491 


' -.136 


.440 


.330 


20 


.588 


.109 


.506 


.135 


21 


.580 


.029 


.676 


.025 . 


22 


.460 


.185 


V418 


.212 


23 


.344 


-.200 


.378 


.319 


24 


.370 


.402 


.433 


' .084 


» 25 


.338 


>-.028 


. .500 


.051 


26 ' ' 


.460 


.108 


.560 


.005 


27 


.357 


* -.074 


.467 


-.015 ' 


28 


.117 


.044 


.141 


.054 


29 


.495 


.042 


.481 


.044 


30 


.291 


.16^ , 


.294 


.•196 


31 


. .292 ' 


-.276 


,352 


^ .006 


32 


.378 


.018 


.386 


" .017 


33 


.318 


.084 


.281 


. .195 


34 A ' 


.313 


.090 


.359 


.128 


35 * 


.339 


.153 


.267 


-.442 


Percent of 










Trftal Variance 


43.92 


3.05 • 


14.59 


2.40 



uensitiv* only to differences in the patterns of the ldadings and was equal to 
.793. The coefficient of congruence is sensitive to differences in both th£ 
.level and* the pattern of loadings and was a high .972. High* values for these 
latter two indices indicate a high degree of similarity between the two. sets of 
factor loadings. The three figurek computed from the parallel random data were 
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• 219, ,067, and .118, respectively. It was' concluded that the factors extracted 
from each administration, of ,the APT were nearly identical, both in nature and in 
strength. ... 



* \ Table 3 

Measures of Factor Similarity Between 
, Factor Loadifigtf of APT at Pretest 

and at Posttest a,nd Between Factor Loadings 
for Corresponding Random Data 



Similarity Ind6x 


APT 


Random Data 


Root-Mean-Square- 






Deviation 


".089 


.219 


Pearson Product-Moment 






, Correlation 


.793 


.067 


Coefficient of 






•Congruence 


'.972 


.118 



Conclusions 



Differences^ in Achievement Level Estimates 



g^ei 



' There was evidence vin these data to conclude that there were gains in 'mean 
achievement levelfe observed a>fter. a course of instruction. The difference be- 
tween the means of scores on the 35-item pre'test and posttest was nearly 7 
items; the frequency cfistri-bution of iujpiber-correct scores changed from a sym- 
metric distribution to one that was negatively skewed and displaced to the 
right. This same effect was mirrored in the distributions of item difficulties. 
The correlation between the two sets of number-correct scores was .542, indicat- 
ing' that students did not generally maintain the^Lr relative standings in the 
course after instruction.^ It is not known to what extent this correlation Was 
attenuated due to the ceiling effect observed for the posttest scores. 

Differences in the Structure of Achievement^ 

Although there was definitive evidence of mean quantitative change from 
pretest to posttest, there wa^jioevidence of qualitative differences in the* 
factor structure underlying theTtenT responses. The internal consistency reli- 
ability of the test remained 'constant across administrations. When factor anal- 
yses were performed separately on the pretest and* posttest interitenr correlation 
matrices, essentially the same factor was extracted each time, as evidenced by 
the similarity in the levels and pattern of factor loadings. 

These data indicate, then, that 'students 'in the General College arithmetic 
classes were indeed leaving the course with increased levels of the same vari- 
able measured prior to instruction. The change that occurred within the quarter 
was quantitative^, not qualitative. 
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STUDY II 



Subjects 



.Method 



Data were collected from students enrolled in a general biology class at" 
the University of Minnesota during- winter quarter of 1980. A paper-and-pencil 
pretest was administered to all students present on the first day of class. 
Computer-administered conventional posttests were given before .classroom mid- 
qugtter and final examinations to volunteer students who were awarded extra- 
call points>for their participation. [ I 



Design 



Tests . There were two different tests administered at various times N 
throughout the quarter. Test A included 14- items from each of the three content 
/areas covered in class lectures before the midquarter^^exam (chemistry, the cell, 
and energy). Test B included 14 items from eath of the last three content areas 
in the course (genetics, reproduction/ embryology, and ecology). 



Experimental groups . The data collection design' for this stujdy is shpwn in 
Figure 3. Students were randomly assigned to two .experimental gT^u^sTSlroups ,1 
and 2, corresponding to the groups of students who were adminiacereX one gf two 
pretests^-Tests A or B, respect£vely-»-on the first day of class. Grdug^yin- 
eluded students who were absent for the first class meeting or who did not re- 
cord on their answer sheet whic^test they took. 

v; Figure 3 

Data Collection Design for Study II 



Pretest 



Group 1 



Test A: 
* Content 
Areas 1-3 



Group 2 



Test B: 
Content 
Areas 4-6 




MQ 
Posttest 



Test A 



Test A 



Test A 



Final . 
; Bxam * „ 
Posttest 



Test A 



Test B 
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During the % two weeks immediately preceding the classroom midquarter exami- 
nation, volunteer students were administered conventional tests on the computer 
*.(MQ posttest). All these students wure administered Test A. "During the two " 
weeks immediately 0 preceding the f ina . exam, volunteer students were administered v 

'final exam posttest). Students in Group 1 
in Groups 2 and 3 were administered Test B. 



conventional tests on the computer ( 
were readministered Test A; students 



V All item responses were coded 
or omifted items did not present an 
Nevertheless, the same 15%-tnissing- 
the previous study: a student's res 
if the student omitted more than 6 ( 
the students included in the analysi 



as 



correct, incorrect, or missing. Missing 
important problem for this set of data.' 
data criterion was used here as was used in„ 
ponse protocol tfas deJLeted from the data set 
i.e., 15% of ki) items on any one test. For 
s*, all missing data were cocked -£s 'incorrect . 



Analyses 

Diffe rences in achievement levejl estimates: Test A. 



V 



The question of 

level estimates on Test A. increased from 



whether or not students 1 achievement 

the pretest* to the MQ posttest 'could be answered by examining the performance of 
Group 1. students on Test A £t both testing occasions, rfbweyer, the numbe^ of • 
students who took Test A both timesjwas small (N ■ 102) compared to the total 

s number of students who took. Test A kt the pretest only (N = 276) and. the total 
number of students who toolc Test A kt the MQ posttest only (N = 302)1 % A more 
powerful test of the difference in/mean achievement levels could be performed by 
combining the data from 'all students who ' took* Test A at-'the MQ pojsttest and by 

"comparing their performance wijzh thlt of all the students who *ook Test A as a 
pretext. , / . 

I ( 

For this comparison, it .was decessary to assume that the three groups of 
^students being combined at the' MQjposttest' were equivalent. ' Group 1 students 
were administered' Test A both at the pretest and at the MQ posttfest. (Although 
Test A was also administered agaii at the final exam posttest, the numbet of 
Group 1 students who returned to ; take Test A at the final exam posttest was too 
small for meaningful comparisons | to be .tirade. Hence, Test A analyses were con- ft 
fined to the pretest a{id MQ posttest Ministrations. ) Performance of Group 1 
students on Test A at^the'MQ posttest can b6 attributed to the students* under- 
lying ability, to the classroom instruction, and/or to the repetition of items 
from one occasion to the next. Group 2 students, on^the other hand, were admin- 
istered. Test B as the pretest and were administered Test A for the first time at 
the MQ posttest.- Performance ^of Group 2 students on Test A, then, could be at-* 
tributed only to the students 1 under-lying ability an4/or to the classroom in- 
struction. For some Group 3 students (those who were absent on the first day of 
class), performance on Test A could also be attributed to their underlying abil- 
ity and/or to the classroom instruction only. For the o£her Group 3 students 
(.those who did not record which pretest thejh took), however, Tfst A performance 
fcould be attributed to "their underlying ability, to the classroom instruction, 
and/or to item repetition. Since thei| two subgroups of Grdirp 3 students could 
not be "identified" a^d separated for analysis, however, Group 3 was omitted from 
the^ following comparison for Test* A. 



\ 



Because students were randomly assigned to Groups 1 and 2 on the fit 
of class, and because classroom instruction was the same for all student 
differences observed between Groups 1 and 2 on their performance on Test A would^ 



first da#^ 
its, arty/\ 



reflect a repfetition-of-i terns effect. If mean- test scores of Groups 1 and 2 
were not % significantly different from each other, then^Grbups 1 ark^2 could be * 
^combined at -the MQ posttest and compared with all students, from Group 1 at the 
pretest. If a significant repetition-of-items effect were found, then subse- 
quent analyses should be performed only on the data from those students in Group 
1. Differences between the scores of Group 1 and Group 2 students were evaluat- 
ed by the use of a £ test for the difference betweea two independent.' groups and 
by the Kolmogorov-Smirnov two-sample test for the difference between two fre- 
quency distributions. ^> ■ > ' • 

Analyses relevant to the issue of differences in achievement scopes includ- 
ed examination of the frequency distributions and summary statistics of i\um- 
berrcorrect scores and the distributions of item difficulties from the pretest 
and the MQ posttest. 

) Differences in the ^structure of achievement; TesjUA . -The question of 
whether or not there were qualitative changes in the *ffeture v of achievement test 
%cores due to instruction was again investigated, as in Study I, by analysis of 
internal consistency reliability coefficients and by separate principal-axes 
facto? analyses. These analyses were performed separately on the pretest and MQ 
posttest data interitem correlation matrices, with communalities estimated using 
an iterative procedure, asMescribed in Study I. The number of nonrandom fac- 
tors was again determined by comparing the results -6f the factor analyses of 
Test A data with the results of factor analyses of random data based on items of 
similar difficulty. 

♦The results of the final solutions from the pretest and the MQ posttest 
were then compared in terms of the numbers of factors extracted and the similar-, 
ity of these factors'. As in Study I, factor -similarity was indexed by the root- 
mean-square deviation, s the product-moment correlation coefficient, and the coef- 
ficient of congruence between the factor loadings obtained at each occasion in 
compari son w^th values obtained from two sets of random data. 

Differences in achievement level estimates: Test B . The question of 
whether; or not students 1 achievement level estimates on Test B increased from ' 
the pretest to the final exam posttest could be answered by examining the per- 
formance of Group 2 students cm Test,B at both testing occasions. However, if 
no significant repetition-of-items effect ^as found for Test A ( as discussed 
ab ove), the assumption could be made that ■•there would be no repetition-of— items 
effect for Test B; then there. would be justification for combining the data on 
Test B from Groups 2 and 3 at the final exam in order^to conduct^ a more powerful 
test of the difference between mean achievement level estimates. Analyses rele- 
» vant to this question included examination of ' the frequency distributions and 
summary statistics of number-correct scores, and *he distributions of item dif- 
ficulties from the pretest and -the final exam posttest. ! 

Differences in the structure of achievement: Test B . As described above, 
the internal consistency reliability coefficient (coefficient alpha) was comput- 
ed for Test B at the pretest and at the final exam posttest. .Separate principal 
axes factor analyses were also performed on the Test B data and oh parallel ran- 
' dom data. The final factor solutions of Test B from the pretest and the final 
exam posttest were also compared in terms of the number of factors extracted and 
the similarity pf these factors, as was done in Study I and for Test A in this 
study. 

' * 18 
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k ' Results , 

Effect of Item Repetition 

. ^The effect on achievement level estimates of repeating items from the pre- 
test to a p3stteSt was evaluated by comparing the performance of students in 
Groups 1 antt 2 on Test A -administered before the midquarter exatih(MQ posttest). 
There were ^02 students from Group 1 who volunteered to take^ttjeJMQ posttest, of 
which 98*met the 15%-missing-data criterion and were retainScTior analyses. For 
Group ^ theSe figures were- 101 and 91, respectively. 

Appendix Table C presentathe frequency distributions of number-correct 
scores for Test A administeredSit the MQ posttest -to students from Groups 1 and 
2; the frequency polygons are displayed irij Figure 4. Fbr Group 1 the mean test 
score was 24,19, the median was 23.79, and the standard deviation was 5.87. For 
Group 2 these statistics were 22.59, 21.80, and 6.26, respectively. A jt test of 
the difference between the meanS of independent groups was calculated to be 
1.98; this was not statistically significant at £ - •01. The entire frequency 
distributions of Groups 1 and 2 were compared by using a Kolmogorov-Smirnov two- 
sample test; the statistic calculated was ejqual to 7.86, which was not statisti- 
cally significant at p * .01. 

Figure 4 

Grouped Frequency Distributions of Number-Correct Scores 
for Biology Test A Administered at MQ Posttest 
\ for Groups 1 and 2 
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Although the observed differences wer£j.n the predicted direction, the ef- 
fect of item 'repetition was not statistically significant. Hence, the question 
of identifying and ^eparatitig the two^subgroups of Group 3 was no longer rele- 
vants and the Test A. MQ posttest* scores of students in Groups 1, 2, and 3 were 
• combined for comparison with the scores of all students who tOok^Test A on the 
first day of c],as§. ■ Since sfome of the students, who took the test at the pretest 
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did ^not take it at the posttest, the correlation between scores at pretest and , 
posttest was not" computed. 

Missing Data, , * , 

* There were ^76 students who were administered Test A at the pretest; of 
these 272 met the 15%-missing-data criterion and were retained for further anal- 
yses. The combined total of students who took T^t A at the MQ posttest was 
302, and' 283 of thes^ were retained for further analyses. . 

Because there was no effect of item repetition observed for" Test A, the 
performance of Group 2 students who were administered Test B at the pretest wa£ 
compared with the performance of students from both Groups 2 and 3 who were ad- 
ministered Test B at the final exam posttest. , There were 283 students who were 
administered Test B at the pretest, of which"^77 met the 15%-missing-data crite- 
rion and were retained- for fjurther analyses. A total of 169 students took Test 
B at the final exam posttest, and 163 df them were retained for further analy- 
ses. 

* 

Differences in Achievement Level Estimates: Test A 

: r ^ s — < 

Total score differences . Frequency distributions of number-correct scores 
on Test A at both testing occasions arp presented in Appendix Table D; the fre- 
quency polygons appear in Figure 5.' Both distributions are approximately sym- 
metric, with the distribution of MQ posttest scores displaced to the right. ^The 
meafi of the pretest scores was 15.97, with a standard deviation of 3.97. For 
the -MQ posttest scores, these figures were 23.46 and 5.99, respectively. The~ 1 
mean score difference between the two occasions was 7.49. Because there was 
some overlap between the students in the two groups, the groups were not strict- - 
ly independent, nor were they strictly dependent. A _t test for the difference 
.between two independent means,- although technically inappropriate, would yield a 
conservative test of the significance of this difference. This test resulted in 
t ( df = 553). * 17.34, p < .001. . " . 

Item difficulties . The frequency distributions of item difficulties for 
Test A at both testing occasions are given in Table 4. As indicated earlier, 
the pretest was somewhat difficult: 74% of the items were answered correctly by 
less than half the students'/ and no item was answered correctly more than 80% of 
the time. After instruction, more than half the items (23 of 42) were answered 
Corrects.^ by 51% to 90% of tY\e students, although* f ive items, were answered cor- 
rectly less than 30% of the time. 

Differences in the Structure of Achievement: Test A 

' < 
Internal consistency reliability . ,Coef flcient. alpha for Test A when admin- 
istered on the- first day of class was .490. Ttyis low value indicates that the 
average interitem correlation wa^ correspondingly small. After instruction, 
coefficient alpha increased to .787 for the same set of items. Although this 
value is not h'igh for a 42-item test, it represents a substantial increase^oVer 
the value obtained at the pretest. The difference between these ■ two' figures may 
indicate that the items were functioning as a set differently after instruction 
than they were before 'instruction and/or it ma/" reflect the increase in the 
^variance "of the number-correct scores. 

20 . . . 



- " Figure 5 ' • ' » 

Grouped Frequency Distributions of Number-Correct Scores 
for Biology Test A Administered at Pretest and at MQ Posttest 
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;, Table 4* , 
Frequency Distributions of Item 
Difficulties ^or Biology Test* A 
Administered' at Pretest 
and afc MQ Postteat 



Range of Item 


Number of Items 


Difficulty 


Pretest 


Posttest 


i-C* : ; 

. .00 - .10 . . 


. + 1 ' 


1 .< ' 


.li -„:.2o . 


*8' 


1 


:2i ~ ' -.30 


. -'8 


3 . « 


.31 - .40 ' 
.41 r ^0 


9. 




5 


. > 


*.5l- - *;60 


4 




, .61 - .70 


2 


5 

*5* 


.,.-.71 - .80 


5 


".81 - .'90 • ' 


0 


8 


.91 - 1.00 ( 
Mean. Difficulty 


0 c 


' 0 


.38 


.56 • ■ 
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Figure 6 1 - m m „ . • 
Eigenvalues for the First 15, Factors Extracted fi^m 'Test A 
^ Administered &t Pretest and at MQ Posttest, 
and from Correspo tiding Random Data 
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Number of factors extracted . * Appendix Table E presents the eigenvalues 'a^d 
percent of totfal variance accounted for by the fir.st 15 factors from Test A apd 
> from corresponding randpm data*T Figure 6a presents the plots tft eigenvalues ' 
versus factors extracted from Test A. and from' random data at th^re test," and 
« * Eigure 6b presents result^ for the MQ posttest. Comparison of the results from 

Test A with the results from the corresponding random data revealed that there 
was one weak factor present in the pretest and" one stronger- factor presentHn 3 
the posttest. 

Factdr similarity . Table 5 presents the factor loadings on tWe single fac- 
'tor extracted at each testing occasion from .Test A and from corresponding random 
* 'data. Comparison of these factor loadings reveals that the loadings from tke MQ- 
posttest were, in general, higher than those from the pretest* No loading from 
the pretest was greater than and nearly two-thirds of the factor loadings 

{26 of 42) were less'than .200. Tor the MQ posttest, the highest loading was 
.502,. but 81% of the factor loadings-(34 of*42) were greater than <200> 

This result, can also' be seen by comparing the, percentages" of total variance 
accounted for by the single factor at each administration. For thepretest that 
figure Vas 3.96% (as compared to 2.88%. for the random data); for the M(Q posttest 
, the factor* accounted for9.36% of . the total varianca(as aompared to^2.79% for 
the^ random data). Both of these percentages are small for a 42-iteiL test, indi- 
» eating that the factor wa.s relatively Weak, even at the MQ posttest. 

The pattern of factor loadings did not appear to be consistent across test 
, , / administrations. The items with the' lowest loadings at the pretest did not 

emerge as the items with the, lowest loadings at the MQ posttest, and the same 
. was true for the items with the highest .loadings . 

' • • i * 

Table 6- presents the measures of 'factor similarity between the two sets of ' 
loadings, for Test A .and the. corresponding random data. The root-mean-^quare 
deviation between the two setsjpf- loadings for Test A, sensitive* to differences 
in levels of the loadings, tf£s^YL95, a high^ value when considered in 'con junction 
with the relatively narrow range of loadings observed* in these data. Jhe prod- 
uct-moment correlation coefficient between the loadings, sensitive to pattern 
{ differences, was a low .373. ' The coefficient of congruence was .780. The simi- 
m ~ . larity measufes obtained from the random data 'were .160, .549, and .548, respec- 
tively^ All these figures reveal that the factors extracted from Test A on the 
two occasions. were ijot substantially more similar than wefe factors extracted 
from randomly generated data. 

These data reveal, then, that the factor ^extracted from Test /Tat the pre- 
test differed substantially from that extracted atf the MQ posttest. Although 
thfere was a sizeable increase in "the number-corre'fct scores after instruction, 
_ there was a corresponding change in the firfct factor underlying the item respon- 
ses. This indicates that the pretest and the MQ posttest measured quite differ- 
, ent variables, even though they were composed df exactly the same items.* 

Differences in Achievement Level Estimates: Test B 

Total score differences . Frequency^ distributions of number-correct scores 
on Test B at both testing occasions are given in Appendix Table F; their fre- 
quency polygons are presented in Figure 7. The distribution of final exam post- 
's ' 
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J . Table 5 

Factor Loadings on the Single Factor 
Extracted from Biology Test A at Pretest and at MQ Posttest, 
and from Corresponding Random Data 



Pretest 



Ppsttest 



Item 


Test A 


Random Data 


Test B 


'Random Data 


1 


.Q68 


-.032 


T^T* 

.186 


.158 


2 


.024 . 


-.026 


.133 


-•205 


3 


.331 * 


' -.245 


.161 


.051 


4 


.115 


.163 


.279- 


.150 


5 


-.002 


_ ~ i 

-.238 


.276 


-.099 


6 


.2tf6 


-.054' 


.008 


.029 


. 7 


..280 


.191 


.372 


.121 


8 


.191 


-.246 


.333 


-.153 


9 


.272 


.096 


.408 


.120 


10 


.027 r 


-.005 


.367 m 


-•002 


11 » 


.291 


-.163 


.154 * 


-.154 


12 


.103 


-.035 


^207 


' .011 


13 


.370 


.327 


.502 


• 208 


14 


.391 


-.197 


.344 


-•223 




.042 


.440 " 


• 38§ 


.418 * 


16 


.273 


-.010 


.341 


• 29f v 


17 


.133 - 


-.042 


.335 


.079^ 


18 * 


.239 


-.105 


. .310 


1 - .162 




.388 > 


.021 


.276 


;i62~ 


• 20 


.205 


. .362 


A10 


.'222 


21 


. .115, 


r»059 


<» .316 


-%098 * 


22 


.223 


^-.040 


.479 • 


-.161 


23 ^ 


.383 


.060 


.298* 


.024 


24 ~ 


• 245 


.067 


.373 


-.114 


25 


.052 


-.053 


^228 


. .18 V"" 


9 26 


-.024 


-.116 


• 246 


-.105, 




.039 


.091 


.478 


.083 


28 


.015 - 


-.094 


.143 


.060 


-29 


.117 


.061 




.244 


30 ^ 


.343 


-.139 


.372 


-.224 


31 


.095 


.0.70 


.200 


..057 


32 


,19"4 


-.027 , 


.284 


* -.154 


'33 


.043 


.179 


.272 


.255 


34 ~ , 


.059' 


-.050 - ' 


.249 


. .337 


35 


.096 • 


-.150 - 


.301 


.190 


36 


-.026 


.148 


.245 


.206 


37 


.221 


-.139 


.340 » 


-.021 


38 L 

39 / 


' .107 


-.185 


.227 , 


-.095 


.106 


.282 


.241 ' 


-.016 


40 


-.111 


-.344 


-.030 


.077 


41 


-.124 


■-JL62 


.164 


-.041 


42 k 


.063 


.113 


.422 


.117 


-Percent of 










Total Variance 


3.96* 


2.88 


9.36 


2.79 
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Table 6 

Measures of Factor Similarity Between Factor 
leadings for Test A at Pretest and at MQ 
Posttest, and Between Factor Loadings 
»from Corresponding Random Data 



Similarity Index 



idiryp 



Test A 



Random Data 





Roo t-Mea n-Sq uar e- ^ 
Deviation .195 






- ^160 




Plarson Product-Moment 






Correlation / .373 


,549 


Coefficient of 






Congruence .780 


.548 

r 



test scores is approximately symmetric, while that of the- pretest scores is 
slightly positively skewed. ■ The mean of the pretest scores was 15.18, with 
standard deviation 3.54. For the final exam posttest scores, these figures were 
21.47 and 4.58, respectively. The score difference between the mean scores on * 



the two occasions was 6.29. 
independent means, though t 
Native test of this difference; here, t (tif 



As before, ^a _t test for the difference between two 
independent means, though technically inappropriate, was conducted as a conser- 



438) =■ 16.15, £ < .001. 



Figure 7 

Grouped Relative Frequency Distributions of Number-Correct Scores 
for Biology Test B Administered at Pretest and at Final Exam Posttest 



>30n 



o 
c 

a 
a* 
o> 
u 

fx* 
> 



Final Exam Posttest 




" i ' 1 1 ' » ' ■ 1 : i ■ ■ ' 1 1 i \ 

10 15 20 25 '30 
Number-Correct Score 



Itdm difficulties . This frequency distributions of iten^ difficulties for 
Test B at bpth testing occasions are given in Table 7. As was observed for the 



25 
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number-correct scores, the pattfern of item difficulties reveals, that the pretest 
was somewhat diffifcult: 74% of the items were answered correctly by less, than 
half the students, and only two items were answered , cbrrectly more than 80% of - 
the time. At th£ end of the course, more than half the items (22 of 42) were 
answered correctly by the majority of /students , although 12 items were answered 
correctly less* than 30% of the time. 



Table 7 

Frequency Distributions of Item 
Difficulties for Biology Test B 
Administered at Pretest and 
-aV Final Exam Posttest 



- * Range of Item 
Difficulty 


Number 
Pretest 


of Items 
Posttest 


^ .00 - ,10 


. 4 


2 


.11 - .20 + 


9 


3 


.21 - .30 


8 


7 o 


.31 - .40 . 


3 


4 


.41 - .50 


7 


4 


.51 - .60 


'5 


2 


.61 - .70 * 


2 


10 


.71 - .80* 


2 


5 


.81 - .90 


2 


4 


.91 -1.00 


0 ' 


1 


Mean Difficulty 


.36 


.51 



Differences 



the Structure of Achievement: Test B 




When administered , at the pretest on the f 
first day" of class, coefficient alpha for Test B ^as .398, increasing to .630 
when administered at the final exam posttest. These low values indicate ^that 
the average interitem correlation coefficient was correspondingly small^ Even 
though both reliability coefficients were relatively ,low, the fact thatxhe re«** 
liability coefficient increased from .40 to .63* may.be an indication that the 
items were functioning as a Set differently after instruction than they were 
before instruction. As before, however, this increase may simply be reflecting 
the increase in the variance of the test scores. 



Nmnbey of factors extracted ., Appendix Table G presents the eigenvalues and , 
percentages of total variance accounted for by the first 15 factors extracted 
from Test B and from corresponding random data. Figure 8a presenftxthe plots^of 
these eigenvalues versus factors extracted at the pretest, and Figure 8b pre- 
sents similar data from the final exam posttest. Comparison of the results from 
the real data with the results;* from fthe random data reveals that^there" was~tfo 
factor stronger than one extracted from the random data in the pretest, but one 
stronger factor was extracted from' Test B at the final exam'posttest. 

J *~ 

Factor similarity . Table 8 presents" the factor loadings on the single fac-* 
tor extracted at each testing occasion from Test B**and from corresponding random 
data. t Compari*8on'of these factor loadings reveals that the loadings from the 
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Figure 8 

Eigenvalues for the First 15 Factors Extracted from Biology Test 
Administered at\ Pretest and at Final Exam Posttest, 
and frcjm Corresponding Random Data 

* * 
(a) Pretest * 
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(b) Final Exam Posttest 
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/ * Table 8 

Factor Loadings on the Single Factor Exttocted 
frdm Biology Test 5 at Pretest and at Final Exam Postvtest 
and from Corresponding' Random Data 
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rosccesc 


It em 


Teat ft 


IvaltUULU Uata 


IcoL D 


f Kanaora Data 


1 


.131 




9Q5 


— HA A 


2 


.073 


ftft7 


71 n 

• JlU 


777 
• J// 


3 


-.023 


— 1 Aft 

• xOO 


1 Q7 


♦ Z JO 


4 


• 218 


1 99 

. 1 


A1 A 

• HID 


HQQ 
• U70 


5 


.252 


-.286 




117 
1 11 J ( 


6 


.268 


145 

• X 




1 7Q 
• 1 / 7 


7 


.191 


.145 




• Z JO 
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- . 1 13 


.296 


9A6 
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. 9Q3 


977 


s — .UDD 


10 


• 32S 




9 55 


9QA 


11 


"".193 


.471 


909 


• UOU 


12 


• .164 


.117 


311 


' - 9^Q 

. £ J7 


13 


.393 


->111 


.371 


1 61 
.101 


14 


-.007 


-. 136 


.438 


ft7ft 

tUJv 


15 


.228 


-.085 


261 


ftA ^ 


16 


.329 


-.099 


.301 

. Jul 


9ftA 


17 


.246 




71 ft 


1 Q 7 ' 
. w J 


18 


' .154 


.381 

<^ .JUL 


. ^79 


• U / J 


19 


.192 


""•098 


.241 


ftftfi 

. UU 0 


20 ' 


-.027 


.341 


.193 


• UIJ 


21 


.231 


-.151 


.307 


ftQ9 


22 


-.239 


-C156 


.268 


41 1 - 

• 411 


23 


.459 


.213 


.299 


1 69 

t • 1U£ ^ 


24 


.062 


.067 


.079 


.140 * 


25 


.009 


.182 


.330 

* JJU 


— ft^7 


26 


.045 


- . ior 


.174 


-.044 
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-.101 


.034 


-.112 


-.057 
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.043 


.119 

• llL 


29 J> 


.296 


-.245 


.084 


.ftftft 

. UO o 


30 ^ 
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.077 


.155 


.328 


31 


.252 


.179 


.397 


.003 

. \J\J J 


32 


.278 


.020 


.177 

• X / / 


-.1 97 


33 


-.045 


.045 


r.'112 


-.082 


34 


• 028 


-.277 


.137 


.003 


35 


.012 


.3B4 


.165 


. .093 
V .047 


36 


.166 


Aon 


-.071 


37 


-.115 


¥-.034 


-.023 


— .026 


-38 


.018 


A*060 


-.002 


.009 


39 


.082 


.12a 


.011 


.053 


40 


.040 


.1Q9 


.178 


.-.088 


41 


.013 


-.457 . 


.105 • 


' t.015 


42 ' * , " k, * > *' 


-*058' 


.510 


-.111 


-.071 


Percent of 










Total Variance 


'3*69 


4.70 


5.96 


2.54 
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final exam posttest were, in general, slightly higher; than those frm the pre- 
test. The highest pretest^ loading -was .459, and nearly two-thirds of tfce factor 
loadings (27 of 42) were less than .200. For the final exam posttest, the high- 
est loading was .438, but mote, than half of the factor loadings (23 of 42) were 
* greater, than .200. 

4 

\ J 

t This result can al.so be seen by comparing the percentage of total variance 
accounted for by the single factor extracted at each administration. For the , 
pretest, that figure was 3.69% ; (as compared to 4.70% accounted for by the random 
/actor); for the final exam posttest; the factor accounted for 5.96% of the to- 
tal variance (as compared to 2.54% , for the random data). Both of these percent- 
ages are very small, indicating that the factor was relatively weak. 

The .pattern of factor loadings did not appear consistent across test admin- 
istrations. The itiems with the lowest loadings at the pretest did not necessar- 
ily emerge as the item's with the lowest loadings at the final exam posttest, and 
the same was'true for the items with the highest loadings. 

v v • 

Table 9 presents the measures of factor similarity for Test B. The root- 
mean-square deviation between the two sets of loadings for Test B, sensitive to 
differences in levels of the loadings, was .177, a high value when considered in 
conjunction with the relatively narrow range of loadings^ observed iri~ this data 
but lower than the .300 observed for the two* sets of random data. The product- 
moment correlation coefficient between the loadings, sensitive to pattern dif- 
ferences, was a low .399 as contrasted with r_ - -.327 for the random data. The 
coefficient of congruence was .697 for Test B and' -.255 for the random d*ta\ 
Although the comparison of the , similarity measures reveals that the f actor^lpad- 
ings for Test B were more congruent than the corresponding sets of random data, 
the degree of similarity was so low that these" factors could not justifiably be 
considered congruent,. 

v . * 

Table 9 

' 4 t Measures of Factor Similarity Between Factor 

Loadings frbm Test B at Pretest and at Final 
.Exam Posttest, and Between factor Loadings 
l % from Corresponding Random Data 



Similarity Index Te-st B Random Data 
t 

Root-Mean-Square 

Deviation * .177 .300 
Pearson Pro duct -Moment 

Correlation v .399 -.327 . 

Coefficient of 1 

Congruence .'696 -.255 



These data reveal, then, that the factor extracted .fr dm Test* B at the pre- 
test dlifered from the factor extracted at posttest. As was observed for *Test 
A ); *i:hei*e was a sizeable increase in the number-correct scores, accompanied by a 
change in the factor underlying the .item responses. This indicates that the 
pretest and the final exam posttest were measuring guite'diirferent variables, 
qven though" they were composed of exactly the same s items. * . 



Conclusions 

Differences in Achievement Level Estimates ) 

The results from both Test A_and Test B indicate that there,,were mean dif- 
ferences in achievement level estimates (number-correct scores) that accompanied 
classroom instruction. On the average, test scores increased after relevant 
course instruction; for these data, scores increased between 6 and 7.5 points on 
a 42-item test. The increases t iAthese test scores were not attributable to the 
effect of item repetition. Althotfch the differences were in the predicted di- 
rection, neither a t^ test nor the Kolmegorov-Smirnov two-sample test were sig- 
nificant at p * .01. 

% ; * 

Differences in the Structure of Achievement 

There were substantial differences in the structure of item responses^to 
the items on both* biology tests — Test A and Test B — from the *pret;ett to the 
posttest. Large increases in the internal consistency reliability " coefficient 
may reflect corresponding .changes in the average interitem correlation coeffi- 
cients. That 'is, changes in the way the items functioned together as a set were 
evident after instruction took place. This same effect was observed when the 
factor structures of the tests at both administrations were* compared.' Althoug 

inly one factor was extracted at eaph administration of each test, the* factor at 
ach pretest was very weak and bore little relationship to the factor extracted 
ater in the course, as reflected in the patterns and levels of the factor load- 
ings. » 

DISCUSSION AND CONCLUSIONS 

The results of thepe studies show that the use of simple difference scores 
to measure changfe in classroom achievement may not be Appropriate for all sub- 
ject matter areas* The use of simple difference scores, or some derivative 
thereof, assumes that there is only a quantitative difference between ~pietest 
and posttest achievement levels due to a^e<^urse of ihstruction. That is, the 
assumption is made that a pretest measures a baseline amount of some knowledge 
or trait and that classroom instruction remits in increased levels of the same 
trait, as indicated by higher scores on the same, or a similar, test. 

This assumption was supported by the results of the mathematics data. 
There was a large and statistically significant difference observed in achieve- 
ment test scores obtained before and after instruction., That the same trait was 
being :measure<f both times was indicated by the. high degree of similarity of the^ 
underlying factor structure of -the test when examined at both points in time. 
The only phange observed in the mathematics test scores was, then, a quantita- 
tive^ one, reflected in increases 'in m^an number-correct score after pl^ssroom 
instruction in mathematics. 

The results were quite different for the two biology .tests examined. ' Fac- 
tor analyses of the pretests revealed the presence of one very weak factor for 
each pretest. One slightly Stronger factor also * emerged at each of the post- 
tests, but % there was very little correspondence between the pretest and posttest 

- 30 ' 



factors.* Even though mean test scores increased after instruction, there was a 
corresponding difference in the factors underlying^test performance. The change 
that occurred in the biology test scores, then, was. a qualitative one, vhere the 
tests were measuring different variables before and after instruction. Evaluat- 
ing gains in achievement by computing pretest-posttest^dif ference scores cannot 
be justified under these circumstances. ^ L 

That the results from these two studies are^if f erent has^ important bearing 
on the issue of program evaluation and the" measurement of.,change. The question 
of whether the difference in test scores that follows classroom instruction or 
program participation is quantitative or qualitative must be answered before any 
attempt at quantifying change can legitimately be made. For' some courses^of 
instruction, the application of classical chgnge-score methodology> may be de- 
fended on the grounds that the only change observed was quantitative; for .oth- 
ers, the use of such methodology may not be justified. Clearly, further^re- 
search is needed* to define thosS areas where the use of change scores or their 
derivatives may be warranted. , 
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Appendix: Supplementary Tables 



"' * Table A 
Frequency .Distributions of Number-Correct Scores 
for APT Pretest and Posttest (N-220) 



Pretest 



Posttest 



Cumvj 



Score Frequency Percent 



Cumulative 

Frequency Percent Percent 



35 

34. 

33 

32 

31 

30 

29 

2%(- 

27 

26 

25 

24 • 

If 

21 
20 
19 
' 18 
17 
16 
15 
14 

13 - 
12 
•11 

io 

9 
8 

MeJaq. 
,SD 

Median 
Mode 



1 

4 

7 

7 
13 

5 
13 

5 

8 
14 
20 
17 
10 11 
14 
16 

N> 
li 
n 

9 
7 
2 
4 
4 
7 
1 
0 

<l 

.22i26 
5.97 
2.1.11$ 
24 



.. '0 
'.5 
1.8 
3.2 
3.2 
5.9 
2.3 
5.9 
2.3 
3.6 
6.4 
9--1 
7.7 
4.5 

6.4 
7.3 
2.7 
5.0 
5.0 
, 4.1 
# 3.2 
Q~9 
1.8 
1.8 
3.2 
0.5 
0.0 
1.4 
0.5 



lOO.d 
100.0 
99.5 

94.5 
. 91.4 
85.5 
83.2 
77.3 
75.0 
71.4 
65. ' 
55.9 
48.2 
43.6 ' 
37.3 
§0.0 
27.3- ' 
22.3 
17.3 
13.2 
1*0.0 
' 9.1 

7.3 

5.5 

2J3 

1.8 

1.8 

0.5 



4 
20 
28 
- 29 
19 
25 
16 
19 
11 

8 

. 7 
7 
6 
1 
5 
4 
1 
3 
1 
0 
0 
1 
2 
1 
0 
1' 
0 

1 

0 

28.91 
4.88 
30.10 
32 



1.8 
9.1 

12.7 

13.2 
8.6 

11.4 
7.3 
8.6 
5.0 

.3-.6 
3.2 
3.2 
. 2.7 
0.5 
2.3 
1:8 
0.5 
1.4 
0.5 
0.0 
0.0 
0^5 
0.9 
0.5 
0.0 
0.5 
0.0 
0.5 
0.0 ' 



J 



100.0 
98.2 
" 89. 1 
76. li 
63.2 
54.5 
•43.2 
35.9 
27.3, 
22.3 
18.6 
15.5 
12.3 
9.5 
9.1 
(6.8 
5.0 
4.5 

2.7 
2.7 
2.7 
2.3 
1.4 
0.9 
0.9 
0.5 
0.5 
0.0 




Tfi^bXe B 

Eigenvalues *nd Percent of Total VariaAceC 
Accounted for by First 15 Factors Extracted frbm- the,APT 
at Pretest and at Posttest, and from Cor responding Random Data 



^Pretest 



APT. 



% Eigen- 
Factor • V^JLue 



% Total - 
Variance 



Random Data 
Eigen^ % Total ' 
Value Variance 



Posttest 



APT 



Eigen- % Total 
Value Variance 



Randoft Data, 
Eigen- % Total 
Value Variance 



1 


5.350 - 


15.3 


1.545 


4.4 


5.590 


' 16.0 


. 1.419 


.4.1 


2 


1.555 


4.4 


1.308 


3.5 


1.'605 


476 


1.253 


3.6-a 
3.3 ) 
3.2~ 


. 3 


1.539 


4.4 


1.229 


. 1.337 


J. 8 


1.161 


4 


1.209 


3.5 


1.139. 


3.3 


1.171 


3.3 


1.134 


5 


1.086 


3.1 


1.029 


2.9 


1.034 


3.0 


1^52 


3.0 


6 


1.016 


2.9 


.993 


2.8 


"1.006 


2.9 ■ 


"* 1.023 


2.9 


7 


.942 


2.7 


.890 


^2.5 


.986 


2.8 


.896 


2.6 


-8 


.892 


2.5 


.865 


• 2.5 


2-939 


2.7 


.828 


-2.4 


9 


.876 


2.5 


.822 


2.3- 


\839 


2.4_ 


* .814 


2s3 


10 


.794 


2.3 


.767 


2.2 


■ 


2.3 


.790 


2.3 


11 


~» V 739 
-*V*.666 


2.1 


.745 


. 2.1 


'.756 


2.2 


.770 


2.2 


12 


1.9 


.692 


2.0 


-.675 


1.9 


; *732 


2.1 


.13 


.607 


1.7 


.634 


1.8 


.660 


1.9- 


.702 


2.0 


14 


.597 


1.7 


••j600 


1.7 


.604 


, 1.7 


.666, 


.1.9 


.15 


.553 


1.6 


■ .566 


1.6 


- .533'° 




.600" 


1.7.. 
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Table G . 
* Frequency Distribution of Number-^borrect Scores for 
Biology Test A at MQ Posfctest; for Students in ^Groups 1 and 2 







Group h ( N-98)' 






Group 2 (N-91) 




<$core 




Cumulative. 




Cumulative 


Frerfliencv 


Percent 


Percent 

* » 


- Frequency Percent 

* • 


Percent 

* 


41 


1 


1.0 


100.0 


u 


^ ft ft 


100.0 


40 


0 J 


0.0 


99.0 . 


A 
O 


ft ft 
U.U 


100.0 • 


39 


o 


0.0 ■ 


*9.0 • 


u 


ft ft 
U.U 


100.0 


38 




0.0 


99-. 0 




1 . 1 


100.0 ^ 


37, 


2 


2.0 


99.0 




* 11 
1.1 


98.9 


36 


1 


1.0 


96.9 




ft ft 
U.U 


97*8 


35 


o 


0.0 1 


95.9 




1.1 


97.8. 


34 


* 1 


1.0 . ' 


95.9 




j . J 


96.7 


33* 


2 


2.0 


94.9 


** 


• 1.1 


93.4 


32 


,3 


3.1 


/92..9 




9 9 
Z« Z 


'92.3 


31 
30 


2 


. 2.0 , 


^9.8 






90. if 


. 5 


5.1 


87.8 




« '11 
1.1 


85.7 


29 


' 6 


6.1 


82*7 


j^ 


i 

J.J 


'84.6 


28 


4 


4.1 


76.5 


i 
i 


1 i 
1.1 


81.3 


>27 


5 


5.1 ' 


72 : .4 


0 


0.0 
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Table D 

Frequency Distribution of Number^Corjrect ScoreV 
for Biology Test A at Pretest and at MQ Posttes^ 
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Table E 

Eigenvalues and Percent of Total Variance Accounted fof by 
First 15 Factors Extracted from Biology Test A at Pret.est 
and at MQ Post test and Corresponding Random Data 
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Table F 

Frequency Distribution of Number-Correct Scores 
for Biology Test B at Pretest and at Final Exam Posttest 
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T>ble G m . • • — 

Eigenvalues and Percent of Total Variance Accounted for by First 
15 Factors Extracted from Biology Test: B at Pretest and at Final Exam 
Posttest and from Corresponding Random bata 
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